Splice-Junction Gene Sequences


This  HyperNext Creator project shows how to implement a neural network project that can be used to recognise three categories -  exon/intron borders. intron/exon borders and neither of canonical patterns.

--------
Data set

The dataset is symbolic and must be converted to a from the neural network can accept.



The first two lines from a typical dataset file are



EIX

GCGGGGTCGCTAAGGCCTCAGGAGGAGAAATGGCTCTCTGCAACCAGTTCTCTGCATCACE



The first line or file header represents the three output classes

   E - exon/intron border,
   I - intron/exon border,
   X - neither.

The second line consists of 60 symbols from the set A, C, G, T, D, N, S, R plus an output of either E, I or X representing the output classification for this sequence.



In addition to the standard DNA symbols D, N, S and A represent don't cares as defined below:-
    D =  A or G or T
    N =  A or C or G or T

    S =  C or G

    R =  A or G

Within the project this mapping is coded within the MakeMapInputs procedure defined in the MAINCODE section.


------
Coding



Each of the 60 inputs is coded into 4 bits so resulting in 240 inputs to the neural network.


   A = 1 0 0 0

   C = 0 1 0 0

   G = 0 0 1 0

   T = 0 0 0 1the don't cares are coded probabilistically based on A C G T



   D = 0.33 0 0.33 0.33   C = 0.25 0.25 0.25 0.25

   G = 0 0.5 0.5 0

   T = 0.5 0 0.5 0The outputs are coded down from 3 outputs to 2 outputs and are represented by



   E = 1 0

   I  = 0 1

   X = 0 0


--------------------



Training and Testing



There are three datasets provided with each being shuffled and named as follows -

   DNA tiny - 10 sequences


   DNA small - 200 sequences


   DNA large - 3190 sequences


 When training the neural network it is recommend that first time users should experiment with the DNA tiny dataset as the larger ones can be quite time consuming. The setup screen allows a dataset file to be loaded, shuffled and then divided into training and testing sections.
For instance, a 40% value indicates that 40% of the loaded file will be used for training and the remaining 60% used for testing. The neural network can also be tested on the training data.


--------------
Project itself



The project can be freely modified and shows various aspects of using HyperNext Creator and the BP1 neural network plugin. The code is not optimised and could be greatly improved, especially to make the mapping more flexible and expandable.



Note

  The project as set up can cope with dataset files have both Macintosh or UNIX line endings.

  
When training the neural network the Escape key can be used to abort the training but on very slow machines there can be a substantial delay between pressing it and the training aborting.

